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ABSTRACT 

The  numerical  solution  of  many  problems  in  continuum  dynamics  is  seriously 
limited  by  the  computation  rates  attainable  on  computers  with  serial  architecture. 
Parallel  processing  machines  can  achieve  much  higher  rates.  However,  applying 
additional  processors  to  a  calculation  is  only  part  of  the  solution.  This 
research  was  undertaken  to  develop  parallel  algorithms  for  explicit  and  implicit, 
Lagrangian  and  Eulerian  finite  difference  schemes  for  computational  continuum 
dynamics  in  one  spatial  dimension. 

First,  the  explicit  conservation  equations  in  the  Lagrangian  reference  frame 
were  readily  reformulated  for  concurrent  processing.  Second,  an  implicit  solution 
was  derived  for  these  equations.  This  is  important  because  it  yields  unconditional 
stability.  The  parallelism  is  achieved  bia  a  block  implicit  numerical  scheme. 

Third,  a  rezoning  algorithm  was  employed  with  each  Lagrangian  integration  stem 
to  transform  the  mesh  back  to  the  Eulerian  reference  frame.  The  algorithmic 
development  path  lead  to  a  parallelization  of  the  processing  in  blocks  of  the 
finite  difference  zones.  At  each  step  of  this  research  project,  the  derived  numerical 
methods  provided  effective  algorithms  for  exploiting  the  architectural  advantages 
of  the  HEP  H1000  (Heterogeneous  Element  Processor)  computer.  The  computational 
timing  data  shows  significant  speed-up  with  the  number  of  processes. 
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Simulation  codes.  The  conservation  laws  of  volume,  mass,  momentum,  and 
energy  apply  to  any  continuum  material:  solid,  liquid,  gas,  plasma,  or  multi¬ 
phase.  Hence,  the  algorithms  of  computational  continuum  dynamics  are  very 
important  for  the  solution  of  many  scientific  problems.  When  the  application 
is  changed  from  one  material  to  another,  only  the  material  law  (equation  of 
state,  constitutive  relation,  or  rate  relation)  changes.  Thus,  there  is  a 
similarity  of  structure  between  the  hydrocodes  (gas  and  liquid  dynamics),  the 
wavecodes  (solid  dynamics),  and  the  magnetohydrocodes  (plasma  dynamics)  that 
are  the  computer  implementations  of  schemes  for  continuum  dynamics  calcula¬ 
tions. 


It  is  well  known  that  computer  simulation  codes  are  cost-effective  tools 
in  continuum  dynamics  research.  Indeed,  a  variety  of  problems  arising  in  such 
fields  as  aeronautics,  controlled  fusion,  meteorology,  reactor  safety,  and 
structural  analysis  provide  strong  motivation  for  the  development  of  higher 
computing  rates.  However,  the  limits  of  current  computing  power  prevent  the 
simulation  of  many  important  problems  to  the  desired  levels  of  temporal  and 
spatial  resolution.  The  speed  of  light  barrier  imposes  a  theoretical  limit  on 
what  can  be  achieved  with  serial  architecture. 


Parallel  processing.  To  achieve  higher  computing  rates  it  has  become 
necessary  to  perform  calculations  in  parallel.  The  computer  architecture  with 
the  greatest  degree  of  parallelism  is  labeled  Multiple  Instruction  stream. 
Multiple  Data  stream  (MIMD) .  An  example  of  a  machine  of  this  type  is  the  HEP 
H1000  computer  manufactured  by  the  Dene! cor  Corporation. 

In  principle,  many  hydrocodes  and  wavecodes  could  be  moved  to  a  parallel 
processor.  However,  applying  additional  processors  to  a  computational  task  is 
not,  in  general,  sufficient  to  produce  significant  speed-up.  Indeed,  the 
development  of  parallel  algorithms  is  an  area  of  research  vital  to  the 
effectiveness  of  parallel  processors.*  Recent  research  indicates  that  the 
parallelization  of  a  program  should  be  organized  from  the  top  down.2*3  That 
is,  the  existing  structure  and  organization  of  a  program  impose  a  limitation 
on  the  achievable  Improvement  in  computation  time.  Consequently,  it  becomes 
necessary  to  reformulate  algorithms  and  to  write  new  code. 
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Parallel  algorithms.  The  direct  approach  to  the  construction  of  parallel 
algorithms  for  continuum  dynamics  calculations  can  be  quite  complicated. 

Rather  than  plunge  into  a  development  project  intended  to  generate  a  code  with 
broad  three-dimensional  capabilities,  we  have  taken  a  step-by-step  approach 
suggested  by  Darrell  Hicks. *  Thus,  in  a  modular  fashion,  the  algorithms  at 
each  level  of  complexity  could  be  verified  as  they  were  derived.  Our  approach 
to  the  numerical  solution  of  the  problems  of  continuum  dynamics  generated 
algorithms  well  suited  for  parallel  architecture  in  general  and  for  the  HEP 
computer  in  particular.  The  approach  is  based  on  a  progression  from  the 
simplest  hydrocode  (one-dimensional,  single-phase,  explicit,  Lagrangian) 
through  the  most  complex  continuum  dynamics  codes.  The  latter  programs  can 
involve  two  or  three  dimensions,  multiphase  material,  a  mixture  of  explicit 
and  implicit  differencing,  arbitrary  rezoning  coordinate  systems  (Lagrangian 
or  Eulerian),  or  variable  time  steps  from  one  spatial  region  to  another. 

Summary  of  results.  The  specific  objectives  of  the  research  reported  in 
this  paper  involve  a  three-step  process  for  the  development  of  parallel 
algorithms  for  one-dimensional  simulation  codes.5  First,  the  algorithms  were 
deduced  for  an  explicit,  one-dimensional,  single-phase  Lagrangian  hydrocode. 
Its  inherently  simple  data  structure  made  it  straightforward  to  integrate  the 
volume,  momentum,  and  energy  equations  for  each  zone,  or  block  of  zones,  in 
parallel.  However,  the  parallelization  of  the  Eulerian  and  certainly  the 
implicit  calculations  was  far  less  obvious.  For  the  second  step,  a  block- 
implicit  method  was  derived  to  allow  the  implicit  differencing  of  the  equa¬ 
tions.  The  separation  into  blocks  required  the  development  of  a  formula  for 
decoupling  the  inherently  interconnected  difference  equations. 

Third,  the  programs  were  converted  from  Lagrangian  to  Eulerian  coor¬ 
dinates  In  such  a  way  that  the  conserved  quantities  are  preserved  in  the  par¬ 
allel  processing  scheme.  The  conversions  from  Lagrangian  to  Eulerian  were 
achieved  via  a  rezoning  technique  that  has  precursors  in  the  work  of 
F.  Harlow5  (Particle  In  Cell  codes)  and  in  the  work  of  W.  Johnson  (PIC  codes 
converted  to  continuum  simulations).  The  method  is  referred  to  as  the  Harlow- 
Johnson  rezoning  technique.5 

Algorithms  that  permit  concurrent  processing  were  derived  for  each  of 
four  numerical  solutions  to  the  one-dimensional  hydrodynamics  equations. 

These  are  the  explicit  and  implicit,  Lagrangian  and  Eulerian  finite  difference 


equations.  In  fact,  somewhat  more  was  accomplished  because,  for  each  of  the 
Lagrangian  and  Eulerian  coordinate  systems,  the  explicit  and  implicit  dif¬ 
ferencing  can  be  blended  in  any  specified  ratio. 

At  all  three  steps  of  the  code  development  and  for  all  four  differencing 
schemes,  explicit  and  implicit  Lagrangian  and  Eulerian,  the  parallel 
algorithms  introduce  a  certain  amount  of  computational  inefficiency.  For  the 
explicit  differencing,  this  only  consists  of  the  overhead  associated  with  the 
initiation  of  the  parallel  paths  and  with  their  synchronization.  However,  the 
implicit  differencing  involves  extra  work  that  has  to  be  done  to  divide  the 
calculations  into  independent  segments  in  addition  to  the  overhead  due  to 
initializing  and  synchronizing. 

A  mathematical  model  was  developed  to  analyze  the  processing  time.  The 
parallel  algorithms  derived  for  this  project  achieve  a  nearly  optimal  paral¬ 
lelization  of  the  calculations  within  the  reference  frame  of  this  two- 
parameter  least  squares  fit  to  the  data. 

Foundations  for  future  work.  The  results  of  this  research  project  pro¬ 
vide  the  foundations  for  extensions  to  more  elaborate  hydrocodes.  Through 
stages  of  increasing  complexity  the  block-by-block  parallelization  of  multi¬ 
phase  and  multidimensional  hydrocodes  can  be  achieved.  The  simulation  of 
multiphase  or  multi  material  flow  is  readily  accomplished  in  an  Eulerian  coor¬ 
dinate  system.^  Discussions  of  two-phase  flow  models  and  their  relevance  to 
reactor  safety  may  be  found  in  Hicks7*®*^*1®  and  Ransom  and  Hicks*1.  (One  of 
the  important  problems  in  reactor  safety  is  the  need  for  fast  simulators  and 
predictors  to  assist  operators  in  handling  situations  such  as  the  event  at 
Three  Mile  Island.)  The  Harl ow-Johnson  rezoning  technique  can  be  modified  to 
del  iver  dynamic  rezoning  (also  known  as  adaptive  mesh)  methods.  Finally, 
extensions  to  multidimensions  may  be  achieved  by  operator  splitting. 

Outline  of  this  report.  While  the  development  of  hydrocodes  is  a  long 
standing  achievement,  it  is  the  utilization  of  new  and  unique  advances  in 
computer  architecture  for  hydrocode  calculations  that  makes  our  research 
important  and  timely.  In  each  of  the  three  steps,  the  crucial  questions 
concern  the  extent  to  which  the  algorithms  can  be  separated  into  calculations 
that  are  performed  concurrently.  The  algorithmic  details  are  developed  in 
Sections  III  and  IV.  Converting  to  Eulerian  coordinates  is  the  topic  of  Sec¬ 
tion  V.  The  computational  results  are  presented  in  Section  VI.  This  includes 


3 


verification  of  the  hydrodynamic  simulations  and  analysis  of  the  computational 
speed-up  as  a  function  of  the  amount  of  parallelism. 

The  design  of  a  parallel  algorithm  is  interrelated  with  the  particular 

1  O  O 

architecture  of  the  parallel  processor.  That  is,  although  some  progress 

has  been  made,1^  general  specifications  for  machine-independent  algorithms 
have  not  yet  been  agreed  upon.  In  Section  II,  we  begin  by  providing  a 
description  of  the  architectural  implications  for  parallel  algorithms  on  the 
HEP  computer. 

Section  II  -  NINO  Architecture 

HEP  hardware.  There  are  three  primary  categories  for  computer  architec¬ 
tures  according  to  their  parallel  processing  capabilities:  SISD,  SIMD,  and 
MIMD.  SISD  stands  for  Single  Instruction  stream.  Single  Data  stream.  The 
typical  serial  computer  has  SISD  architecture.  SIMD  stands  for  Single 
Instruction  stream.  Multiple  Data  stream.  Vector  machines  such  as  the  CRAY-1 
have  this  architecture.  The  multiple  data  streams  consist  of  the  components 
of  the  vectors.  The  instruction  mode  is  still  single  stream  (or  serial) 
although  the  instructions  may  generate  vector  operations.  MIMD  stands  for 
Multiple  Instruction  stream.  Multiple  Data  stream.  The  HEP  (Heterogeneous 
Element  Processor)  by  Denelcor  has  MIMD  architecture. 

The  HEP  computer  is  designed  to  combine  from  one  up  to  16  Process  Execu¬ 
tion  Modules  (PEM's)  in  a  single  computer.  Each  PEM  is  an  eight-segment  pipe¬ 
lined  processor  consisting  of  eight  Function  Units  in  the  Instruction  Process¬ 
ing  Unit.  The  Function  Units  include  the  hardware  for  initiating  parallel 
processing  and  for  an  integer  addition,  a  floating  point  addition,  a  multipli¬ 
cation,  and  several  divides. 

With  every  cycle  of  the  machine  (100  nanoseconds),  one  of  the  Function 
Units  can  undertake  a  new  calculation.  Thus,  multiple  pieces  of  data  can  be 
in  the  pipeline  of  a  PEM  simultaneously.  The  calculations  are  the  result  of 
instructions  selected  from  different  instruction  streams. 

Within  a  PEM,  as  many  as  50  (the  standard  software  limit  or  64  the  hard¬ 
ware  limit)  processes  or  lists  of  instructions  can  be  making  computational 
progress  at  the  same  time.  While  the  active  processes  in  a  PEM  are  each 
assigned  their  own  sets  of  registers  and  data  memory,  they  can  also  access 
shared  memory.  It  is  this  ability  to  synchronize  multiple  processes  from  the 


same  job  or  task  that  makes  the  HEP  computer  a  true  MIMD  machine.  More 
information  on  the  HEP  architecture  and  on  its  applications  can  be  found  in 
the  references  by  Smith^  and  Jordan.^ 

HEP  software.  The  HEP  computer  achieves  its  parallel  processing 
capabilities  by  extending  FORTRAN  77  in  two  ways.*  With  their  first  implemen¬ 
tation  of  a  compiler,  Denelcor  provided  the  CREATE  statement  and  the  asynchro¬ 
nous  variable.  In  their  more  recent  release,  they  have  replaced  these  with 
callable  subroutines  which  make  the  FORTRAN  portable.  Even  though  portability 
is  a  good  objective,  the  authors  are  disappointed  by  this  change  of  heart  at 
Denel  cor . 

Although  the  subroutine  calls  serve  the  same  purposes,  the  CREATE  state¬ 
ment  and  the  asynchronous  variable  gave  the  programmer  more  explicit  language 
for  coding  calculations  that  are  intended  to  be  performed  concurrently.  In 
any  event,  a  discussion  of  these  two  terms  is  appropriate  for  this  report 
because  it  provides  a  good  explanation  of  the  considerations  that  must  be 
taken  into  account  when  processing  is  divided  into  and  reunited  from  parallel 
paths. 

1.  The  CREATE  SUBROUTINE  statement  is  similar  (syntactically)  to  the  well- 
known  CALL  SUBROUTINE  statement  in  FORTRAN.  It  has  the  effect  of 
creating  a  copy  of  (or  "cloning")  the  original  subroutine  and  executing 
the  copy  in  a  cal cul ational  stream  parallel  to  the  mainstream. 

2.  The  asynchronous  ("doll ar-sign")  variable  is  any  acceptable  FORTRAN  vari¬ 
able  name  prefixed  with  a  "$".  Asynchronous  variables  are  used  for 
communication  between  parallel  computational  streams.  Asynchronous  vari¬ 
ables  have  two  states,  "full"  and  "empty".  If  a  FORTRAN  assignment 
statement  contains  an  asynchronous  variable  on  the  right  hand  side  of  the 
equal  sign,  then  the  calculation  waits  until  the  state  of  the 
asynchronous  variable  is  full.  If  its  state  is  full,  then  the  value  is 
fetched  and  the  state  is  set  to  empty.  If  the  left  hand  side  of  the 
equal  side  of  a  FORTRAN  assignment  statement  is  an  asynchronous  variable, 
then  the  assignment  of  its  value  waits  until  its  state  is  empty.  Then, 
when  the  assignment  is  made,  its  state  is  reset  to  full. 

The  diagram  in  Figure  1  presents  a  visual  image  of  the  parallel  process¬ 
ing  as  implemented  on  the  HEP  computer.  At  CREATE  statements,  the 


*  The  work  for  this  contract  was  implemented  in  the  FORTRAN  language. 
Denelcor  has  also  developed  the  parallel  processing  features  described 
in  this  section  for  the  scientific  programming  language  C. 
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main  stream 


parallel  streams 


•  CREATE  statement 

•  RETURN  statement. 

•  Asynchronous  variable 

•  Asynchronous  variable 

emptied. 

filled. 

Figure  1.  Fork  and  join  points  in  a  FORTRAN  program  mark  the  beginning  and 
end,  respectively,  of  concurrent  (or  parallel)  processing  segments  of  the 
code. 

computational  flow  can  "fork"  into  a  number  of  parallel  paths  which  the 
operating  system  assigns  to  the  available  processors.  An  empty  asynchronous 
variable  prevents  the  main  stream  from  continuing  beyond  the  point  at  which  it 
needs  the  results  from  the  parallel  streams.  A  "join"  point  is  marked  by  the 
return  from  a  CREATE  statement  and  an  asynchronous  variable  that  must  be  reset 
to  full  by  the  last  parallel  stream  to  finish  processing. 

One  of  the  crucial  aspects  of  parallel  processing  is,  of  course,  the 
development  of  software  capable  of  coordinating  concurrent  computational 
tasks.  Denelcor  has  chosen  a  straightforward  extension  of  FORTRAN  for  the 
HEP.  The  "fork"  and  "join"  procedures  make  the  HEP  computer  immediately 
accessible  to  the  traditional  scientific  programming  community.  A  collection 
of  lectures  on  various  aspects  of  concurrent  computation  contains  further 
background  material  on  parallel  processing  algorithms  and  architecture.15  The 


6 


HEP  FORTRAN  77  User's  Guide  is  useful  for  more  specific  information  on  the  HEP 
software.1® 

Section  III  -  Parallel  Algorithms  —  Explicit  Lagrangian 

The  first  step  of  the  research  was  the  application  of  the  parallel 
processing  attributes  of  the  HEP,  as  described  above,  to  an  explicit,  one¬ 
dimensional  Lagrangian  hydrodynamics  simulation.  In  this  case,  it  is  a 
straightforward  exercise  to  solve  the  difference  equations  in  a  zone-by-zone 
parallelization.  When  the  number  of  processes  is  significantly  less  than  the 
number  of  zones,  it  is  necessary  for  the  sake  of  efficiency  to  collect  contig¬ 
uous  zones  into  blocks.  The  block  size  or  granularity  is  simply  the  number  of 
zones  divided  by  the  number  of  processes. 

Conservation  equations.  A  hydrodynamic  simulation  is  based  on  the 
conservation  laws  of  volume,  mass,  momentum,  and  energy.  In  the  Lagrangian 
frame,  the  corresponding  differential  equations  are  differenced  on  a  mesh  in 
which  the  grid  points  remain  fixed  in  the  material.  Let  X  be  the  Lagrangian 
spatial  coordinate  (the  one  that  identifies  the  initial  location  of  the  mass 
point)  and  let  v  be  the  mass  coordinate  with  units  of  mass  per  area.  The  two 
are  related  by 

du  =  p°  dX  (1) 

where  p  is  the  mass  density  as  a  function  of  X  and  the  zero  superscript  indi¬ 
cates  p  is  evaluated  at  time  t  =  0.  Let  V  be  the  specific  volume  such  that 

(2) 

i 

This  choice  of  coordinates  guarantees  conservation  of  mass.  ] 

Assuming  rectangular  symmetry  (slab  geometry),  the  remaining  conservation 
laws  in  differential  form  are 

•  Conservation  of  volume: 


3V.3u 

3t 


(3) 
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•  Conservation  of  momentum: 


3u 

3t 


a 

an 


(p  +  q) 


(4) 


•  Conservation  of  energy: 

H =  "  lir  Eu  (p  + 


(5) 


where  u  is  the  velocity  (or  specific  momentum)  of  a  fixed  point  in  the  mass,  p 
is  the  pressure,  q  is  the  material  viscosity,  and  E  is  the  total  specific 
energy.  In  vector  notation  this  system  can  be  written  as 


au  _  3F(U) 
at  "  ay 


(6) 


where 


U  =  (V,  u,  E)T 


(7) 


and 


F(U)  =  (u,  -(p  +  q),  -u  (p  +  q))T.  (8) 

As  usual,  the  superscript  T  denotes  the  transpose  of  the  array. 

If  e  is  the  specific  internal  energy,  then 

E  =  e  +  \  u2  .  (9) 

Taking  the  partial  derivative  of  equation  (9)  with  respect  to  time  and  substi¬ 
tuting  equations  (3),  (4),  and  (5)  into  the  result  yields  an  internal  energy 
equation 


<10> 

in  place  of  equation  (5).  This  system  of  equations  (3),  (4),  and  (5)  or  (10) 
is  incomplete  without  an  equation  for  the  pressure.  The  system  is  closed  with 
a  thermomechanical  equation  of  state  (EOS)  relating  p  to  V  and  e. 
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Finite  difference  equations.  The  two  most  common  finite  differencing 
schemes  are  the  von  Neumann-Richtmyer  and  the  Lax-Wend roff . ^  The  former  was 
preferred  for  this  work  because  it  appears  to  have  good  accuracy  and  less 
tendency  to  oscillate  in  response  to  strong  shocks  and  rarefactions.*0  For 
the  discretization  of  the  time  and  mass  per  area  independent  variables,  the 
usual  convention  was  adopted.  A  superscript  denotes  time  dependence  and  a 
subscript  indicates  the  location  in  the  mass  variable.  With  this  notation, 
the  half  integers  denote  time  and  mass  centering  in  the  mesh.  Thus,  the  von 
Neumann-Richtmyer  discretization  scheme  for  the  conservation  laws  becomes19 


Conservation  of  volume: 

Vn+1 

J+l/2 

^.n+1 


/"  un+1/2  .  un+1/2 
J+l/2  _  Uj+1  Uj 


Conservation  of  momentum: 


n+1/2  _  n-1/2 

J _ J _ 

tn+l/2  tn-l/2 


(pj+l/2  *  qj+i/j) 


n  +  n-1/2  -j 
3j  -1/2  +  qj-l/2J 


"3+1/2  *  wj-l/2 


•  Energy  equation: 


n+1  n  rl  /-n+1  u.  i  A  „n-l/2i run+i  ,,n  ^ 

j+l/2  ‘  j+l/2  "  "  W  ^ P j  +1/2  +  p j  +1/2 qj+l/2llVj+l/2  '  Vj+l/2^  *  (13) 


n-1/2 -i  fwn+1 


The  EOS  expresses  Pj+i/2  in  terms  °f  ^j+1/2  anc*  ej+l/2’ 

When  real  viscous  effects  are  negligible  or,  at  least,  insufficient  to 
handle  severe  gradients  in  the  physical  properties  of  the  material,  an 
artificial  viscosity  is  superimposed  to  average  the  abrupt  variations  over 
several  adjacent  zones.  A  well  known  implementation  of  artificial  viscosity 
is  given  by  a  formula  due  to  von  Neumann-Richtmyer  as  modified  by  Rosenbluth1 
and  Landshoff .23 


qj+l/2 


if  uj+1  -  Uj  >  0 


(14a) 


qj+l/2  =  '  Aj+l/2iuj+l  "  uj 


if  uj+1  -  Uj  <  0 


(14b) 
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Figure  6.  The  bl  ock-by-bl  ock  algorithm  for  the  explicit  Lagrangian  calcula¬ 
tions  achieves  a  speed-up  factor  of  eight  to  nine  between  10  and  14  processes 


Figure  7.  The  bl  ock-by-bl  ock  algorithm  for  the  implicit  Lagrangian  calcula¬ 
tions  achieves  a  speed-up  factor  of  eight  to  nine  between  10  and  14  processes 


hardware  is  saturated,  the  processing  time  for  n  parallel  processes  is 
approximated  by 


'("I'VfV'V’i  <36> 

where  T0(n)  is  the  CREATE  overhead  for  n  parallel  processes  and  K  is  the 
number  of  times  that  forking  and  joining  occur  in  the  execution  sequence. 

Thus,  the  last  term  accounts  for  the  parallel  processing  overhead,  that  is, 
the  computer  time  lost  to  the  creation  of  the  parallel  processes  and  to  the 
communication  between  them. 

The  curve  described  by  equation  (36)  qualitatively  agrees  with  the  plots 
of  the  data  in  the  timing  Figures  6  through  9  up  until  saturation  of  the 
hardware  occurs.  At  that  point  the  term  1/n  Tp  in  the  equation  is  replaced  by 
a  constant.  Then,  the  data  shows  very  slight  linear  increases  in  processing 
time  with  additional  processes  beyond  the  hardware  saturation  point.  Thus, 
the  effect  of  the  term  K  TQ(n)  is  a  linear  increase  in  processing  time. 

The  results  of  the  bl ock-by-bl ock  calculations  as  shown  in  Figures  6,  7, 
and  8  indicate  that  the  speed-up  factor  peaks  at  a  value  between  eight  and 
nine  somewhere  in  the  range  of  10  to  14  processes.  The  speed-up  factor 
approaches  this  value  because  there  are  essentially  nine  segments  in  the 
calculations:  eight  in  the  pipeline  plus  the  store  operation.  This  peaking 
phenomena  in  the  range  of  10  to  14  processes  has  been  observed  by  several 
other  investigators.^*'*’’^ 

The  timing  date  for  the  implicit  Eulerian  calculations  as  displayed  in 
Figure  8  was  collected  after  the  installation  of  the  new  Mass  Storage  System 
(MSS)  and  additional  nodes  in  the  switch  network  on  the  HEP  computer.  The 
result  had  the  effect  of  an  Eulerian  routing  through  the  switch  network  which 
had  been  upgraded  from  two  to  eight  nodes.  The  longer  path  for  accessing  data 
caused  a  severe  increase  in  memory  latency  which  translates  into  a  smaller 
utilization  percentage  of  the  pipeline  by  a  single  process.  Thus,  the  data  on 
Figure  9  illustrate  the  performance  of  a  multi  process  code  spread  over  several 
PEM's.  However,  the  same  timing  model  still  applies. 
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For  the  explicit,  Lagrangian  equations  (11)  and  (12)  with  equations  (15), 
(17),  and  (24),  self-scheduling  processes  were  written  and  implemented  as 
f  ol  1  ows 

Fork,  compute  time  steps,  join; 

Fork,  compute  viscosities,  join; 

Fork,  compute  velocities,  join; 

Fork,  compute  volumes,  join;  and 

Fork,  compute  pressures,  join. 

For  the  zone-by-zone  algorithm,  the  self-scheduling  within  each  fork  and  join 
means  that  the  processes  integrate  the  equations  within  a  single  zone  and  then 
ask  for  another  zone  until  none  remain.  An  asynchronous  variable  keeps  track 
of  the  next  zone  for  which  processing  is  required. 

For  the  block-by-block  algorithm,  between  each  fork  and  join,  a  block  of 
Ng  contiguous  zones  is  handed  over  to  each  of  the  Np  parallel  processes.  The 
algorithm  is  prescheduled  by  setting  NpNg  equal  to  the  total  number  of 
zones.  Of  course,  the  calculations  using  the  zone-by-zone  algorithm  encoun¬ 
tered  so  much  contention  for  the  asynchronous  zone  counter  that  they  proved  to 
be  quite  inefficient.  This  approach  was  quickly  dropped  in  favor  of  the 
prescheduled  blocks  of  contiguous  zones. 

Figures  6  through  9  show  the  computation  time  and  speed-up  factors  versus 
the  number  of  processes.  All  cases  were  run  with  2520  (the  least  common 
multiple  of  all  positive  integers  less  than  or  equal  to  ten)  zones  for  400 
time  steps. 

Mathematical  model  for  the  computation  time.  The  data  shown  in  these 
figures  can  be  interpreted  according  to  the  following  model  suggested  by 
discussions  in  articles  by  Buzbee,*  Larson,^  and  Flatt.^2  Consider  a 
computer  program  that  consumes  a  total  processing  time  of 


T 


t 


TS  +  TP 


(35) 


where  Tp  is  the  total  processing  time  spent  on  calculations  which  can  be 
divided  into  parallel  processes.  Ts  is  the  processing  time  consumed  by  the 
calculations  that  are  performed  serially  when  the  program  is  executed  in 
either  serial  or  parallel  mode.  Then,  up  until  the  point  at  which  the 
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characteristics  of  the  problem.  Hence,  the  finite  difference  scheme  inte¬ 
grates  the  equations  exactly. ^ 

Two  anomalies  occur  with  the  implicit  versions  of  the  code.  First,  as 
discussed  in  Section  IV,  the  loss  of  accuracy  due  to  the  implicit  differencing 
affects  the  results  in  the  same  way  as  an  artificial  viscosity  term.  So  the 
material  discontinuities  are  smeared  over  several  zones.  As  expected,  in  the 
completely  implicit  (8  *  1  in  equation  (30))  Lagrangian  and  Eulerian  simula¬ 
tions,  the  rarefaction  and  shock  are  smeared  over  several  adjacent  zones. 
Second,  since  the  explicit  solutions  along  the  boundary  seams  are  of  higher 
order,  discrete  points  of  greater  accuracy  are  introduced  in  the  material. 

The  overall  stability  of  the  implicit  scheme  holds  the  solution  close  to  these 
accurate  boundary  conditions  within  each  block  of  zones. 

All  four  differencing  schemes  have  been  coded  into  a  single  program  in 
which  the  choice  of  Lagrangian  or  Eulerian  and  a  value  of  8  in  the  momentum 
equation  (30)  are  selected  at  run  time.  A  large  number  of  samples  of  the 
output  from  many  hours  of  HEP  computer  time  is  displayed  in  the  Appendix.  It 
includes  spatial  plots  of  the  pressure  at  various  times  during  the  simulation 
of  the  flow  of  the  material  in  the  test  problem  pipe.  For  each  pressure 
profile,  the  results  are  shown  from  simulations  with  one,  two,  and  six 
processes. 

As  an  illustration  of  other  aspects  of  the  computer  program,  one  pressure 
profile  is  displayed  with  the  corresponding  plots  of  material  velocity  and 
mass  density.  These  are  also  shown  for  one,  two,  and  six  processes.  The 
velocity  graph  is  a  nice  illustration  of  the  increasing  momentum  of  the 
material  as  it  bursts  out  of  the  left  half  of  the  pipe. 

Tiaing  results  on  the  HEP.  Computer  codes  have  been  written  for  each  of 
the  four  hydrodynamic  simulations:  explicit  and  implicit,  Lagrangian  and 
Eulerian  finite  differences.  To  demonstrate  the  improvements  in  computational 
efficiency,  the  results  are  discussed  in  this  section  for  the  four  versions  of 
the  hydrocode  with  a  linear  pressure-vol  ume  material  law.  The  simple  test 
problem  for  which  the  exact  solution  is  known  involves  shock  and  rarefaction 
waves.  The  initial  conditions  are  an  ideal  pressure  and  density  shock  in  the 
interior  of  a  motionless  slab.  The  boundaries  are  held  fixed  and  the  shock 
moves  forward  with  a  rarefaction  proceeding  in  the  opposite  direction.^® 
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Thus,  if  u"+1/2  >  0,  for  example,  then  inequality  (33)  reduces  to 

J 

u0"*1/2  it  <  x$+1  -  xj  .  (34) 

Constraints  of  this  form  are  reminiscent  of  the  CFL  restriction  and  are  some 
times  referred  to  as  the  material -Courant  or  the  mass-flow-Courant  restric¬ 
tions. 

This  rezoning  algorithm  automatically  conserves  volume  and  mass. 

Momentum,  internal  energy,  and  kinetic  energy  are  each  conserved  individually 
according  to  a  step  function  or  a  linear  interpolation  for  the  accumulated 
mass  per  area.  The  derivation  of  such  models  can  be  found  in  Hicks  and 
McGrath.*^  The  rezoning  step  is  included  following  the  calculation  of  the  new 
energies.  Therefore,  the  momentum  is  reassigned  at  time  tn+^^  and  the 
volumes,  pressures,  and  energies  at  time  tn+1. 

Section  VI  -  Computational  Results 

The  hydrocode  written  for  this  contract  has  been  exercised  extensively 
for  two  reasons.  First,  the  simulations  have  to  be  verified  against  hydro- 
dynamic  test  problems.  Second,  and  more  to  the  point  of  the  research  objec¬ 
tives,  the  speed-up  achieved  by  the  parallel  algorithms  has  to  be  analyzed. 

For  the  first  of  these  purposes,  the  four  versions  of  the  code  were  tested  on 
a  physical  problem  for  which  the  analytic  solution  is  known. 

Rarefaction-shock  simulation.  A  gas  at  standard  temperature  and  pressure 
is  contained  in  the  right  half  of  a  pipe.  The  material  gas  under  otherwise 
identical  conditions  has  twice  the  density  and  a  pressure  of  two  atmospheres 
in  the  left  half  of  the  pipe.  At  time  zero  the  material  is  allowed  to  flow. 

A  shock  moves  to  the  right  and  a  rarefaction  to  the  left.  They  bounce  off  the 
ends  of  the  pipe  and  return  to  interact  in  the  center.  Thus,  this  single 
physical  problem  tests  the  code  on  its  ability  to  simulate  shocks,  rarefac¬ 
tions,  and  interactions  between  them. 

If  a  linear  approximation  (17)  to  an  isentropic  equation  of  state  is  used 
to  compute  the  pressures,  both  the  shock  and  the  rarefaction  move  as  square 
waves  in  the  absence  of  viscosity.  The  explicit  Lagrangian  simulation  repro¬ 
duced  this  behavior  identically  when  the  Courant-Friedrichs-Lewy  condition  is 
equal  to  one.  In  this  case,  the  information  flow  in  the  material  is  along  the 
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In  the  Lagrangia  advance  the  zone  boundary  moves  from  x?  to  x?+*  for 

J  J 

each  j.  The  top  part  '  Figure  5  shows  the  mass  moving  to  the  right.  Then, 
from  the  bottom  part  of  the  figure,  we  see  that  the  rezoning  step  transfers 
mass  from  zone  [Xj.j,  Xj]  to  zone  [xj ,  Xj+1].  Similarly,  momentum  and  energy 
are  transferred  from  one  zone  to  the  next.  Careful  bookkeeping  must  be  main¬ 
tained  to  ensure  conservation  of  these  quantities. 

Move  the  mesh. 


Rezone  to  the  previous  location. 

Figure  5.  The  Harl ow-Johnson  method  rezones  the  mesh  back  to  its  previous 
location  after  each  Lagrangian  time  step. 

Material -Courant  condition.  Implicit  here  is  the  assumption  that  zone 
boundaries  do  not  travel  further  than  a  zone  width.  That  is, 

xn  <  xn+1  <  xn  (3 1 

xj-l  xj  Xj+1  * 

If  this  constraint  is  violated  then  the  rezoning  gets  a  bit  more  complica¬ 
ted.  The  inequality  (32)  is  equivalent  to 


xD  ,  <  x1)  +  At  uTl/Z  <  x"  . 


(33) 


Boundary  seam 

t 


Figure  4.  The  domain  of  dependence  for  the  velocities  has  a  half-width  of 
three  zones  when  three  explicit  time  steps  6t  are  required  to  integrate  the 
equations  from  time  tn"^  to  time  tn+*^. 

Section  V  -  Eulerian  Coordinates 

Harl ow -John son  rezoning.  Once  a  Lagrangian  hydrocode  has  been  construe 
ted,  it  can  be  converted  to  an  Eulerian  code  by  making  use  of  the  Harlow- 
Johnson  rezoning  method.5  Figure  5  illustrates  the  technique.  An  Eulerian 
calculation  is  achieved  in  two  steps.  The  first  is  a  Lagrangian  calculation 
and  the  second  is  a  rezoning  of  the  mesh  back  to  its  original  location.  Sin 
the  rezoning  leaves  the  mesh  points  fixed  in  space,  the  calculation  is 
Eulerian.  Another  way  of  viewing  this  rezoning  scheme  is  from  the  point  of 
view  of  operator  splitting.  That  is,  the  discretization  of  the  Eulerian 
operator  3/3t  +  u3/3x  Is  split  Into  the  advance  of  the  Lagrangian  part 
( 3/3t)  followed  by  the  advance  of  the  convective  part  (u3/3x). 


developed,  this  approach  to  parallel  algorithms  for  tridiagonal  linear  systems 
(or,  more  generally,  for  the  evaluation  of  recursive  sequences)  has  been  laid 
aside  for  future  examination. 

The  block  implicit  technique.  For  hydrocode  calculations,  the  original 
system  of  differential  equations  can  be  divided  into  uncoupled  tridiagonal 
systems  by  inserting  spatial  boundaries  along  which  the  values  of  the  unknowns 
are  determined  by  a  stable  explicit  scheme.  Several  explicit  steps  may  be 
required  for  each  implicit  integration  step. 

For  an  implicit  solution  in  a  hydrodynamics  simulation,  the  material  is 
divided  into  p  blocks  of  contiguous  zones.  The  program  forks  into  parallel 
paths  to  solve  the  momentum  equation  explicitly  at  each  of  the  p-1  internal 
boundary  seams.  After  synchronization,  the  program  forks  again  to  solve  the 
explicit-implicit  momentum  equation  (30)  within  each  of  the  p  blocks  of  zones 
using  the  standard  tridiagonal  algorithm.  Finally,  the  volume,  pressures,  and 
energies  are  computed  with  repeated  fork -pa rail  el -join  operations  just  as  in 
the  explicit  Lagrangian  code. 

Internal  boundary  seams.  For  the  following  discussion  of  the  details, 
assume  that  a  temporal  seam  of  boundary  conditions  is  to  be  calculated  at  the 
mass  mesh  point  uj.  Let  the  implicit  integration  step  at  time  tn  be  At. 
Suppose  that  the  CFL  stability  condition  for  the  von  Neumann-Richtmyer 
explicit  integration  requires  three  time  steps  to  advance  the  dependent 
variables  from  tn  to  tn  +  At.  Let 

fit  =  At  (31) 

and  use  equation  (12)  to  advance  u!?“j^  from  time  t t0  t-jme  tn-l/2 
+  fit  for  i  =  -1,  0,  1  as  shown  in  Figure  4.  Then,  since  the  pressures  are 
known  in  the  appropriate  zones  up  through  time  tn,  the  velocities  can  be 
advanced  from  time  tn"1//2  +  fit  to  time  tn“1//2  +  2  fit.  At  this  point,  equa¬ 
tions  (11)  and  (13)  have  to  be  integrated  in  a  similar  manner  from  time  tn  to 
time  tn  +  fit  in  the  surrounding  zones.  Finally,  the  velocity  is  integrated 

from  time  tn"^2  +  2  fit  to  time  tn"^2  +  3  fit  *  tn+^2  to  achieve  u?+*^  on 

J 

the  Internal  boundary  seam.  Figure  4  illustrates  the  domain  of  dependence 
for  Uj+^.  The  velocities,  densities,  pressures,  and  energies  are  evaluated 
explicitly  within  this  domain. 


A  mixture  of  the  explicit  and  implicit  differencing,  that  is,  a  weighted 
average  of  the  explicit  and  the  implicit  terms  was  written  into  the  simulation 
codes  in  order  to  permit  continuous  variation  between  the  completely  explicit 
and  the  completely  implicit  solutions.  The  blended  form  of  the  momentum 
equation  (12)  is 


un+l/2,  uii-l/2 
tn+l/2_  tn-l/2 


n+l/2_  n+1/2 
pi+l/2  "i-1/2 


(30) 

As  pointed  out  in  Section  2.2  and  proven  in  the  appendix  of  an  INEL  report,^ 
only  the  pressure  term  in  the  momentum  equation  has  to  be  pushed  forward  in 
time  for  the  purely  mechanical  system  to  achieve  unconditional  stability. 
Specifically,  the  system  of  equations  (11)  and  (30)  with  the  mechanical 
equation  of  state  (17)  is  unconditionally  stable  for  0  «  1. 

An  operator  splitting  approach  was  mapped  out  to  extend  the  unconditional 
stability  to  a  thermal  mechanical  system.  The  technique  separates  the  thermal 
and  mechanical  parts  of  the  rate  relation  for  the  pressure  and  alternately 
solves  the  two  energy  equations.  See  a  forthcoming  KMSF  report  on  uncondi¬ 
tionally  stable  modifications  to  the  von  Neumann-Richtmyer  differencing  scheme 
by  Hicks  and  McGrath. 

Recursive  sequences  In  parallel.  The  implicit  solution  of  coupled 
differential  equations  typically  leads  to  a  tridiagonal  linear  system  for 
which  a  parallel  algorithm  is  not  immediately  obvious.  It  is  possible  to 
solve  the  equations  in  parallel  via  an  a  priori,  symbolic  inversion  of  the 
system.^  This  involves  the  evaluation  of  several  recursive  sequences.  The 
method  divides  a  recursive  sequence  into  parallel  processes  by  the  concurrent 
evaluation  of  even  and  odd  terms  (or,  more  generally,  by  the  concurrent  eval¬ 
uation  of  the  n  sequences  of  terms  whose  indices  are  equivalent  modulo  n). 

This  is  the  technique  mentioned  in  the  proposal  for  the  research 
project.  As  discussed  in  the  proposal,  the  method  involves  some  dupl ication 
of  calculations.  However,  a  net  gain  is  realized  from  the  more  efficient  use 
of  the  parallel  processor.  Since  a  more  natural  mechanism  for  hydrocodes  was 
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Forward  differencing.  In  genera] ,  the  time  centered  quantities  should  be 
evaluated  at  the  forward  times  tn+1//2  or  tn+1  in  order  for  the  unconditional 
stability  of  an  implicit  solution  to  be  achieved.  Thus,  with  the  viscosities 
omitted,  the  implicit  form  of  equation  (12)  is 


un*l/2  _  un-l/2 
J  j 

tn+l/2  _  tn-l/2 


.n+i/2 

Pj+1/2 


,n+l/2 

3j-l/2 


Jj+l/2  '  Uj-l/2 


(26) 


The  loss  of  second  order  accuracy  introduced  by  the  implicit  solution  is 
equivalent  to  an  artificial  viscosity  term.  This  can  be  seen  from  a  two-term 
Taylor  series  expansion.  In  either  zone  j  +  1/2  or  zone  j  -  1/2, 

pn+1/2  «  p(tn  +  |  At) 


«  p(tn)  +  ^iE^I  At  +  0(At2) 


=  Pn  + 


* pf1  +  \  lu" At  +  (2?) 

p  p 

where  the  0(Att)  notation  Indicates  that  terms  of  order  At*-  or  of  higher  order 
have  been  neglected. 

Substituting  equation  (27)  into  equation  (26)  yields 


where 


un+l/2  _  un-l/2 

J  _ J 

tn+l/2  _  ^n-1/2 


r,n  nn 

PJ+l/2  ~  Pj-l/2  _  Qj 

uj+l/2  "  pj-l/2  Mj+l/2  "  1/2 


(28) 


n 


<fi  - 1 «  t(^) 

J  £  dV  J+1/2  V 


U"  -  U1? 
J+l  J 


n  ,,n  ,.n 

Jn  n  u.  -  u. 

_  J 

W  M/2  UJ  "  Vl 


(29) 


which  has  the  form  of  the  artificial  viscosity  q  as  defined  by  equa¬ 
tions  (14).  Artificial  viscosity  methods  are  employed  to  smooth  material 
discontinuities  over  several  spatial  zones.  The  above  analysis  shows  that  the 
need  for  this  technique  in  the  explicit  differencing  is  removed  from  the 
implicit  differencing. 
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The  modified  stability  condition  that  takes  the  magnitude  of  the  viscosity 
into  account  is^ 


At  <  min  — 
j  a' 


(24) 


where 


a'  .  a  (y  +  [(j)2  +  l]1/2> 


(25) 


Two  of  the  references  by  Hicks  contain  a  more  general  discussion  of  hydrocode 
convergence  probl ems.2®*^ 

Parallel  segments  of  the  code.  To  perform  the  explicit  Lagrangian 
integration,  subroutines  were  constructed  to  compute 

1.  The  explicit  time  step  from  the  inequality  (24), 

2.  The  viscosities  from  equations  (14), 

3.  The  momentum  from  equation  (12), 

4.  The  zone  volumes  from  equation  (11),  and 

5.  The  pressures  and  internal  energies  from  equation  (17),  (18),  or  (19)  and 
equation  (13). 

The  main  program  CREATES  the  designated  number  of  copies  n  of  the  first 
routine.  It  uses  the  fork  and  join  structure  of  parallel  programming  to 
compute  the  minimum  time  step  for  each  of  n  different  blocks  of  zones.  The 
minimum  of  these  values  is  selected  for  the  program  time  step.  Then,  in 
similar  fashion,  the  program  repeats  the  fork-parallel -join  procedure  with  the 
same  block -by-block  structure  for  each  of  the  four  remaining  sets  of  calcula¬ 
tions.  These  kinds  of  parallel  programming  tasks  are  described  further  in  a 
couple  of  the  references. 

Section  IV  Parallel  Algorithms  —  Implicit  Lagrangian 

The  object  of  implicit  calculations  for  a  hydrocode  is  the  relaxation  of 
restrictions  on  the  time  step.  Explicit  time  steps  are  usually  constrained  by 
the  CFL  (Courant,  Friedrichs,  Lewy)^*22  condition  that  is  required  for 
stability  as  discussed  above.  The  explicit  solution  advances  the  components 
of  equation  (6)  from  time  tn"*/2  to  time  tn+*/2  or  from  time  tn  to  time  tn+* 
in  terms  of  quantities  evaluated  at  times  tn  or  tn+*/2,  respectively.  Such  a 
centered  difference  scheme  yields  second  order  accuracy. 


Figure  2.  The  data  structure  for  the 
conservation  of  momentum  has  an  explicit 
form  as  stated  by  equation  (20). 


Figure  3.  The  data  structure  for 
the  conservation  of  volume  is  also 
explicit  as  stated  by  equation  (21) 


Figures  2  and  3  also  show  the  offsets  in  both  time  and  space  (mass)  between 
the  momentum  and  volume  equations  as  specified  by  the  von  Neumann-Richtmyer 
differencing  scheme. 


The  explicit  time  step  At  is  constrained  by  the  CFL  (Courant,  Friedrichs, 
Lewy)*^*22  condition  that  is  required  for  stability  and  convergence.  Various 
necessary  and  sufficient  conditions  on  the  CFL  number  in  hydrocodes  with 
classical  thermomechanical  equations  of  state  such  as  (18)  or  (19)  are 
contained  in  Richtmyer  and  Morton^  while  some  recent  results  for  rate 
dependent  and  related  computational  techniques  such  as  subcycling  have  been 
developed  by  Hicks. 23.24 


The  CFL  stability  condition  requires 


a 


n+1/2 
j  +1/2 


tn) 


Wj+1 


for  every  zone.  That  is. 


(22) 


At  <  min  —  . 
j  a 


(23)  i 
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where  r  >  0  is  the  Gruneisen  parameter.  This  form  is  often  used  in  research 
involving  shock  waves  in  solids. 


For  the  above  equations  of  state  (17),  (18),  and  (19),  it  is  easy  to 
reduce  the  von  Neumann-Richtmyer  implicit  discretization  of  the  internal 
energy  equation  (13)  to  an  explicit  expression  for  ejti/2*  the  E0S  1S 
purely  mechanical  as  in  equation  (17),  meaning  p  depends  only  on  V,  then  both 
the  energy  equation  (13)  and  the  EOS  are  not  needed  to  complete  the  system  and 
they  may  be  omitted  altogether. 

One  of  the  advantages  of  the  parallel  computer  (MIMD  architecture)  over 
the  vector  computer  (SIMD  architecture)  pertains  to  the  parallelization  of  the 
material  law  routine.  These  calculations  do  not  in  general  vectorize. 

Indeed,  in  certain  cases,  a  large  portion  (often  over  75%)  of  the  processing 
time  is  spent  on  computationally  intensive  material  laws. 

More  complicated  situations  occur  when  the  EOS  is  not  as  straightforward 
as  equations  (17),  (18)  or  (19)  and  when  energy  transport  mechanisms  such  as 
conduction  become  important.  In  such  cases,  equation  (13)  with  additional 
terms  for  energy  transport  mechanisms  must  be  solved  implicitly.21  A  discus¬ 
sion  of  parallel  algorithms  for  implicit  equations  is  contained  in  the  follow¬ 
ing  paragraphs.  More  general  energy  equations  are  beyond  the  scope  of  this 
article.  An  examination  of  these  problems  is  a  natural  extension  that  the 
authors  would  like  to  promote. 

Data  structure  of  the  equations.  The  time  and  space  structure  of  the 
data  as  it  has  just  been  described  leaves  it  in  a  form  that  is  immediately 
amenable  for  parallelization.  The  data  structure  for  the  conservation  of 
momentum  is  illustrated  in  Figure  2  while  the  conservation  of  volume  is  shown 
in  Figure  3.  The  discrete  conservation  of  momentum  equation  (12)  has  the  form 

unew  uold  "  bright  "  °left^  r 

where  r  3  At/Ay  and  a  *  p  +  q.  Similarly,  the  discrete  conservation  of  volume 
equation  (11)  has  the  form 


new 


'old 


+  (u 
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where  the  coefficient  of  artificial  viscosity  is 

Aj+l/2  =  C£  aj+l/2  +  cq  pj+l/2 j+1  "  ujl  *  (15) 

The  coefficients  of  the  linear  and  quadratic  terms  are  adjustable  parameters 
for  particular  applications  c£  is  of  order  unity  and  Cq  of  order  one  tenth. 

The  acoustic  impedance  is 

a  =  c$p  (16) 

where  c$  is  the  sound  speed  of  the  material. 

The  superscripts  in  the  difference  equations  (11),  (12),  and  (13)  indi¬ 
cate  the  order  in  which  they  are  solved.  First,  the  velocity  u  is  updated 
from  time  tn_1/2  to  time  tn+^2  using  equation  (12)  where,  in  practice,  it  has 
not  been  found  necessary  to  extrapolate  the  viscosity  q  to  time  tn.  Then,  in 
a  leapfrog  fashion,  equation  (11)  is  employed  to  vault  the  specific  volume  V 
over  the  velocity  from  time  tn  to  time  tn+1.  Finally,  all  that  remains  is  the 
energy  equation  (13)  in  which  compression  and  viscous  work  contribute  to  the 
heating  of  the  material. 

Equation  of  state.  If  the  equation  of  state  has  a  tractable  analytic 
expression  for  the  pressure,  it  can  be  substituted  into  the  energy  equa¬ 
tion  (13).  Then,  upon  rearranging  terms,  the  internal  energy  e  is  advanced 
from  time  tn  to  tn+*.  One  such  form  is  a  tangential,  linear  approximation  to 
an  isentropic  equation  of  state 

p  -  p0  *  -  ao2(V  ■  Vo>  <17) 

where  a0  is  the  reference  acoustic  impedance  of  the  material.  Another  is  the 
ideal  gas  law 

p  =  (y  -  1)  e  p  (18) 

where  r  is  the  ratio  of  specific  heats  Cy  to  cj.  A  general  izat ion  of  the 
ideal  gas  law  is  the  Mie-Griineisen  law 

p  =  f(p)  +  r  e  p  (19) 


( 


10 


"5T  400 


0 

2  200 

to 

to 

Ui 

o 

2  100 
0. 


2520  Zones 
400  Time  steps 


5  10  15  20 

NUMBER  OF  PROCESSES 


0  5  10  15  20 

NUMBER  OF  PROCESSES 


Figure  8.  The  block -by -block  algorithm  for  the  explicit  Eulerian  calculations 
achieves  a  speed-up  factor  of  eight  to  nine  between  10  and  14  processes. 
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Figure  9.  The  block -by -block  algorithm  for  the  implicit  Eulerian  calculations 
achieves  a  speed-up  factors  approaching  20  for  computations  with  high  memory 


A  major  objective  of  this  research  project  is  the  maximization  of  the 
speed-up  factor32 


T(n)  (n-l)T  n  K  T  (n)  (37) 

1  + _ 1  + - 2 - 

Tt  Tt 

where 

Tt  *  T ( 1 )  (38) 

from  equation  (36).  Thus,  for  S(n)  and  the  efficiency32 

E(n)4s(n)  (39) 

to  be  high,  the  denominator  of  equation  (37)  has  to  be  small.  This  means 
that,  for  a  fixed  number  of  processes  n,  a  quantity  limited  by  the  hardware, 
we  need  small  values  of  Ts,  K,  and  T0(n). 

The  parallel  overhead  T0(n)  is  a  machine-dependent  quantity.  It  appears 
to  be  proportional  to  n  as  shown  above.  While  this  is  true  of  other  machines 
as  well,32  it  can  also  be  proportional  to  log  n.31  In  any  case,  the  values  of 
Ts  and  K  are  subject  to  the  effectiveness  of  the  pf-allel  algorithms.  Thus, 
for  a  given  machine,  it  is  important  to  reduce  Ts  and  K  as  much  as  possible  in 
order  to  achieve  the  highest  speed-up  factor  and  the  greatest  efficiency. 

For  the  hydrocode  timing  data,  equation  (36)  can  be  rewritten  as 

T(n)  -  i  +  (1  -  i)  Ts  ♦  K  CQ(n  -  1)  (40) 


where  T$  and 


T0(n)  *  CQ(n  -  1)  (41) 

have  been  normalized  to  the  total  time  Tt.  Also,  equation  (35)  was  used  to 
eliminate  Tp  which  again  illustrates  that  the  timing  results  depend  only  on 
the  percentage  of  the  calculations  done  serially  and  on  the  overhead  term. 


25 


An  analysis  of  the  analytic  continuation  of  equation  (40)  to  a  differen¬ 
tiable  function  shows  that  the  speed-up  factor  (37)  and  the  efficiency  (39) 
achieve  a  maximum  for 


max 


-  4^)1/2 


(42) 


which  yields  the  minimum  normalized  computing  time 


T  .  =  T  +  2(1-Tj1/2  (K  Cj1/2  -  K  C 

min  s  v  s'  '■O''  o 


(43) 


by  substituting  equation  (42)  into  equation  (40). 

Least  squares  approximations  that  fit  equation  (40)  to  the  data  displayed 
on  Figures  6  through  9  yield  values  for  Ts  and  C0  on  the  order  of  one  percent 
and  10~7  to  10"®,  respectively.  These  small  values  for  the  fraction  Ts  of  the 
computer  time  that  is  spent  in  serial  mode  are  an  indication  that  the  speed-up 
factors  S(n)  are  nearly  optimal. 

The  least  squares  approximations  were  calculated  to  fit  the  data  between 
one  and  seven  to  nine  processes.  The  corresponding  range  of  values  for  the 
parameters  Ts  and  C0  are  displayed  in  Figure  10.  Equations  (42)  and  (43)  were 
used  to  get  the  values  of  nmax  and  Tmin.  The  achievable  improvement  in 
computation  time  is  a  tenth  of  a  percent  or  better  of  the  serial  computation 
time.  That  is,  for  an  MIMD  machine  not  limited  by  the  hardware  to  a  speed-up 
of  eight  to  nine  at  10  to  14  processes,  the  derived  algorithms  can  achieve  a 
speed-up  of  more  than  100  at  several  tens  of  processes. 
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Explicit  Lagrangian 
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l-5xl0"7 
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Implicit  Lagrangian 
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.075-. 081 
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.0054±.0002 
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Figure  10.  The  improvement  in  computation  time  Tm^n  that  can  be  achieved  with 
the  parallel  algorithms  for  computational  continuum  dynamics  is  a  tenth  of  a 
percent  or  better  of  the  serial  processing  time.  The  column  labeled  k  is  the 
number  of  fork -join  occurrences  per  time  step.  Thus,  for  these  data, 

K  *  400k. 
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A  renark  concerning  processing  overhead.  One  final  remark  concerns  the 
trade-off  between  the  CREATE  statement  and  the  asynchronous  variables.  The 
value  of  K  can  be  altered  by  reducing  the  number  of  CREATES.  The  greatest 
effect  can  be  achieved  by  eliminating  the  repeated  fork-parallel -join  approach 
employed  for  both  the  zone-by-zone  and  the  block -by -block  algorithms.  To 
accomplish  this,  the  calculations  for  viscosities,  velocities,  volumes,  and 
pressures  were  incorporated  into  a  single  subroutine.  A  program  was  written 
to  fork  into  multiple  copies  of  this  routine  at  the  beginning  of  the 
simulation.  Then  each  process  in  either  zone-by-zone  or  block -by -block 
fashion  performed  the  complete  set  of  calculations  for  that  time  step.  At  the 
end  of  the  time  step,  synchronization  was  achieved  through  an  asynchronous 
variable,  but  the  process  was  not  terminated  by  returning  to  the  mainstream. 
Instead,  all  processes  continued  with  the  next  time  step.  They  did  not  join 
until  the  integration  was  complete. 

Of  course,  the  synchronization  step  constitutes  a  "join"  in  the  proc¬ 
essing  to  be  followed  by  another  "fork".  Thus,  this  approach  did  not  provide 
the  expected  improvement  in  the  speed-up  factor.  Evidently,  the  memory 
contention  for  the  asynchronous  variable  outweighs  the  savings  in  CREATE 
overhead.  For  these  reasons,  we  believe  that  the  timing  figures  illustrate 
close  to  the  optimum  speed-up  achievable  for  these  hydrocode  calculations. 
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APPENDIX 


The  attached  figures  illustrate  the  performance  of  the  simulation  code  on 
the  test  problem  defined  in  Section  VI.  Figures  1  through  10  contain  45 
illustrations  of  pressure  profiles.  Figure  1  shows  spatial  plots  of  the 
pressure  at  five  different  times  during  the  simulation  performed  by  the 
explicit  Lagrangian  code  with  only  one  process.  Figures  2  and  3  each  show  the 
same  five  pressure  plots  generated  by  the  explicit  Lagrangian  code  with  two 
and  six  processes,  respectively. 

Figures  4  and  5  show  pressure  plots  generated  by  the  implicit  Lagrangian 
code  with  one,  two,  and  six  processes.  Similarly,  Figures  6  and  7  display 
results  for  the  explicit  Eulerian  code  while  Figures  8,  9,  and  10  are  for  the 
implicit  Eulerian  code. 

Figure  11  shows  one  more  pressure  plot  with  the  corresponding  velocity 
and  density  plots.  These  data  were  generated  by  the  explicit  Eulerian  code 
with  six  processes. 

Note  that  as  explained  in  Section  IV,  the  implicit  differencing  will 
model  the  material  behavior  less  accurately.  Also,  as  discussed  in  Sec¬ 
tions  IV  and  VI,  the  implicit  simulations  done  with  multiple  processes  should 
not  be  expected  to  produce  identical  results.  Indeed,  the  block  implicit 
algorithms  introduce  spatial  points  of  greater  accuracy. 
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Figure  1.  The  explicit  Lagrangian  simulation  models  the  rarefaction  and  shock 
as  square  waves  when  the  CFL  number  equals  one  and  the  equation  of  state  is  a 
straight  line  approximation  to  an  isentropic  ideal  gas  law. 
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Figure  3.  The  pressure  plots  remain  unchanged  when  the  explicit  Lagrangian 
simulation  is  performed  with  six  processes. 
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Figure  4.  The  Implicit  differencing  in  a  Lagrangian  simulation  introduces 
terms  that  have  the  effect  of  viscosity  on  the  material  flow. 
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Figure  5.  In  the  implicit  Lagrangian  code,  the  momentum  equation  is  integra¬ 
ted  explicitly  along  boundary  seams  that  separate  the  material  into  independ¬ 
ent  segments.  The  simulation  with  two  processes  has  one  boundary  seam  and  the 
one  with  six  has  five.  In  each  case,  the  material  is  divided  into  equal -mass 
blocks  of  zones. 
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Figure  8.  The  implicit  Eulerian  simulation  is  derived  from  the  implicit 
Lagrangian  with  a  rezoning  algorithm. 
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Figure  9.  The  implicit  Eulerian  simulation  with  two  processes  has  one 
boundary  seam  at  which  the  data  are  computed  more  accurately. 
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Figure  11.  The  large  negative  pressure  gradient  In  the  material  induces 
positive  velocities  as  the  material  bursts  out  from  the  left  half  of  the 
pipe.  These  results  were  generated  by  the  explicit  Eulerian  code  using  six 
processes. 
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